So you want to know the range of products being sold by your competitor. You go to their website and see all the products (along with the details) and want to compare it with your own range of products. Great! How do you do that? How do you get the details available on the website into a format in which you can analyse it?
Hmmm.. If you have these or similar questions on your mind, you have come to the right place. In this post, we will learn about web scraping using R. If you like a more structured approach, try our free online course, Web Scraping with R.
What exactly is web scraping or web mining or web harvesting? It is a technique for extracting data from websites. Remember, websites contain wealth of useful data but designed for human consumption and not data analysis. The goal of web scraping is to take advantage of the pattern or structure of web pages to extract and store data in a format suitable for data analysis.
Now, let us understand why we may have to scrape data from the web.
Below are few use cases of web scraping:
To be able to scrape data from websites, we need to understand how the web pages are structured. In this section, we will learn just enough HTML to be able to start scraping data from websites.
A web page typically is made up of the following:
HTML element consists of a start tag and end tag with content inserted in between. They can be nested and are case insensitive.
DOM (Document Object Model) defines the logical structure of a document and the way it is accessed and manipulated. In the above image, you can that the HTML is structured as a tree and you trace path to any node or tag. We will use a similar approach in our case studies.
The class attribute is used to define equal styles for elements with same class name. HTML elements with same class name will have the same format and style. The id attribute specifies a unique id for an HTML element. It can be used on any HTML element and is case sensitive. The style attribute sets the style of an HTML element.
We will be the following R packages in this tutorial.
library(robotstxt)
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(forcats)
library(magrittr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tibble)
library(purrr)
In this first case study, we will scrape the details of best selling smart phones from Amazon. Our goal is to extract the following:
As mentioned earlier, we will first check if we can scrape data from the web page using paths_allowed() from the robotstxt package. We need to specify the url of the web page using the paths argument. If we can access the web page, paths_allowed() will return TRUE, else FALSE.
Since it has returned TRUE, let us go ahead and download the web page using read_html() from the xml2 package and store it in top_phones. We do this to ensure not to make repeated requests to the website which may lead to our IP address being blocked.
The first detail we want to extract is the brand name of the phone. If you look at the HTML code, it is nested within a hyperlink, defined by <a>. The link is inside a section identified by the class crwTitle. We will use this information to ask rvest to extract text content which will give us the brand name.
The location is specified using html_nodes() and the text extracted using hmtl_text(). Since crwTitle is a class, we use . before it but not for a as it is a HTML tag. Both the class and the tag are specified with quotes and separated by space.
If you observe the output, it includes the following:
To extract the brand name, we will use str_split() from stringr and specify the pattern \\( i.e split the string @ the first opening bracket. Since ( is a special character, we use \\ for escaping. Next, we use map_chr() from the purrr package to extract the first element from the resulting list. Finally, we remove the white space using str_trim().
The whole point of the above exercise is to show that extracting the data using rvest is just one part of web scraping. We need to spend enough time tidying and reshaping the data to get it into a format useful for data analysis.
In the previous step, we observed that the data extracted from top_mobiles included the color of the mobile as well. So the location of the color in the HTML is same; within the hyperlink of the crwTitle section. But now, we want to extract the color and not the brand name.
We will split the original string @ ( and extract the second part which includes:
The color is separated from the rest by a comma. We will use the , to split the string and extract the color using map_chr() i.e. extract the first element from the resulting list.
Let us extract the ratings for the phones now. If you look at the HTML code, we can locate rating within the following:
<span><a>.crwProductDetailIt is wrapped within <span> identified by the class .a-icon-alt which is inside a hyperlink in the section identified by the class .crwProductDetail.
In the outptut, you can observe the text out of 5 stars for each rating. Let us get rid of this text by selecting the first 3 characters using str_sub(). We pick the first 3 characters using the start and end arguments and supply them the values 1 and 3. Finally, we convert the rating to a number using as.numeric().
Now that we know the rating for each of the top 10 best selling smart phones, let us find out how many people have reviewed them. The number of reviews is located within the following:
.a-link-normal<span> tag identified by the class .a-size-small.crwProductDetailWe use the above information within html_nodes() to extract the data. Now let us clean it up a bit and convert it into a number instead of leaving it as a character. If you use as.numeric() directly, you will see NA in the result, the reason being the presence of comma in the number of reviews. First, we need to get rid of the comma, which we will do using str_replace(). We replace the comma with nothing as shown in the code below and then convert it into a number.
The price is one of the most important factor when it comes to choosing a smart phone. Let us look at the price of the best selling mobile phones. Again, looking at the HTML code, the price can be located within the following:
<span> tag identified by the class .a-text-strikecrwPrice and .crwProductDetailUsing the above information, we can extract the price of the mobile phones which is returned as a character vector but we need to convert it to numeric if we are to analyze it further. Let use convert the price to a number using the following steps:
str_trim() to remove the white spacesstr_sub()str_replace()str_split()map_chr()using as.numeric()Deep discounts are one of the strategies adopted by ecommerce firms to drive sales. Let us look at the actual price (after discount price) of the best selling mobile phones. The discounted price can be located within the following:
<span> tag identified by the class .crwActualPricecrwPrice and .crwProductDetailUsing the above information, we can extract the discounted price of the mobile phones. Let use convert the price to a number using the same steps as in the case of real price.
In this case study, we will extract the following details of the top 50 movies from the IMDB website:
As we did in the previous case study, we will look at the HTML code of the IMDB web page and locate the title of the movies in the following way:
<h3> tag.lister-item-contentIn other words, the title of the movie is inside a hyperlink (<a>) which is inside a level 3 heading (<h3>) within a section identified by the class .lister-item-content.
The year in which a movie was released can be located in the following way:
<span> tag identified by the class .lister-item-year<h3>).lister-item-contentIf you look at the output, the year is enclosed in round brackets and is a character vector. We need to do 2 things now:
Date instead of characterWe will use str_sub() to extract the year and convert it to Date using as.Date() with the format %Y. Finally, we use year() from lubridate package to extract the year from the previous step.
The certificate given to the movie can be located in the following way:
<span> tag identified by the class .certificate<p>).lister-item-contentThe runtime of the movie can be located in the following way:
<span> tag identified by the class .runtime<p>).lister-item-contentThe genre of the movie can be located in the following way:
<span> tag identified by the class .genre<p>).lister-item-contentThe rating of the movie can be located in the following way:
.ratings-imdb-rating.ratings-barSince rating is returned as a character vector, we will use as.numeric() to convert it into a number.
To extract votes from the web page, we will use a different technique. In this case, we will use xpath and attributes to locate the total number of votes received by the top 50 movies.
xpath is specified using the following:
In case of votes, they are the following:
metaitempropratingCountNext, we are not looking to extract text value as we did in the previous examples using html_text(). Here, we need to extract the value assingned to the content attribute within the <meta> tag using html_attribute().
Finally, we convert the votes to a number using as.numeric().
We wanted to extract both revenue and votes without using xpath but the way in which they are structured in the HTML code forced us to use xpath to extract votes. If you look at the HTML code, both votes and revenue are located inside the same tag with the same attribute name and value i.e. there is no distinct way to identify either of them.
In case of revenue, the xpath details are as follows:
<span>namenvNext, we will use html_text() to extract the revenue.
To extract the revenue as a number, we need to do some string hacking as follows:
$as.character()$ and Mas.numeric()In this case study, we will extract the following details of the 50 movst visited websites in the world:
Let us look at the code and the output first.
Surprising right! Whenever data is structured as a table in HTML, you need to specify the class .table in html_node() and it will return all the tables in the web page after which you can use html_table() to extract and convert the whole table to a data.frame in R.
Since the names of the columns are very long, we have renamed them to be concise and descriptive.
Now, let us look at the categories to which these top 50 websites belong using count() from dplyr. We will sort the result in descending order using the sort argument and assign it the value TRUE.
Let us club some of them to remove the sub categories.
Let us calculate the % of these categories and plot them.
In this case study, we are going to extract the list of RBI (Reserve Bank of India) Governors. The author of this blog post comes from an Economics background and as such was intereseted in knowing the professional background of the Governors prior to their taking charge at India’s central bank.
The data in the Wikipedia page is luckily structured as a table and we can extract it using html_table(). There are 2 tables in the web page and we are interested in the second table. Using extract2() from the magrittr package, we will extract the table containing the details of the Governors.
Let us arrange the data by number of days served. The Term in office column contains this information but it also includes the text days. Let us split this column into two columns, term and days, using separate() from tidyr and then select the columns Officeholder and term and arrange it in descending order using desc().
What we are interested is in the background of the Governors? Use count() from dplyr to look at the backgound of the Governors and the respective counts.
Let us club some of the categories into Bureaucrats as they belong to the Indian Administrative/Civil Services. The missing data will be renamed as No Info. The category Career Reserve Bank of India officer is renamed as RBI Officer to make it more concise.
Hmmm.. So there were more bureaucrats than economists.
To get in depth knowledge of R & data science, you can enroll here for our free online R courses.